UBC Bioinformatics and Statistics Workshop Series

GNU Make for Reproducible Research

Presenter: Tony Liang | tony.liang@hli.ubc.ca
January 28, 2026

Objective

Learn how to use GNU Make to automate your research workflows

Who is this for?

  • Anyone tired of manually running scripts in order
  • People who want reproducible research workflows

What you’ll learn:

  • Why automation matters for research
  • How Make tracks dependencies automatically
  • Build a real data analysis pipeline

Setup GitHub Repo

If you wish to follow the hands-on tutorial, you need a GitHub account

  • Open the tutorial repository
  • Click on “Use this template” , “Create a new repository”
  • Enter a name you desired of the repository copy
  • Click on “Create repository”

Setup Posit Cloud

Using posit cloud (Rstudio server on web)


  • Enter to this website
  • Click on “Sign up with GitHub”
  • Prompt to login with your github credentials

Setup Posit Cloud

  • Click on “New Project from Git Repository”
  • Paste in the your github url
  • Run install.packages(c("tidyverse", "patchwork") in the new workspace

Note

The github URL should be of your repository with https://github.com/your_user_name/repo_name.git

An example of our typical code workflow

my_full_experiment_version2.R
suppressPackageStartupMessages(library(dplyr))
library(readr)
library(broom)
library(ggplot2)
library(patchwork)

# Download data from internet
dest <- "data/iris.csv"
url <- "https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv"
download.file(url, dest, quiet = FALSE)

# Read raw data
iris_clean <- read_csv("data/iris.csv", show_col_types = FALSE)

# Clean the data
iris_clean <- iris_clean %>%
  tidyr::drop_na() %>%                     # Remove rows with missing values
  rename_with(tolower) %>%          # Standardize column names to lowercase
  mutate(id = row_number())         # Add a unique ID column

# Write cleaned data
#write_csv(iris_clean, "data/iris_clean.csv")

cat("Data cleaning complete!\n")
cat(sprintf("  Original rows: %d\n", nrow(iris_clean)))
cat(sprintf("  Cleaned rows: %d\n", nrow(iris_clean)))


# Read cleaned data
#iris_clean <- read_csv("data/iris_clean.csv")

# 1. Summary statistics by species
summary_stats <- iris_clean %>%
  group_by(species) %>%
  summarise(
    across(c(sepal_length, sepal_width, petal_length, petal_width),
           list(mean = mean, sd = sd),
           .names = "{.col}_{.fn}")
  )

# 2. Fit a simple linear model
model <- lm(sepal_length ~ petal_length + species, data = iris_clean)
model_tidy <- tidy(model)
model_glance <- glance(model)

# Create output file
sink("results/model_summary.txt")

cat(strrep("=", 60), "\n")
cat("IRIS DATASET ANALYSIS\n")
cat(strrep("=", 60), "\n\n")

cat("Summary Statistics by Species:\n")
cat(strrep("-", 60), "\n")
print(summary_stats)

cat("\n\nLinear Model Results:\n")
cat(strrep("-", 60), "\n")
cat("Model: sepal_length ~ petal_length + species\n\n")
print(model_tidy)

cat("\nModel Fit Statistics:\n")
print(model_glance)

cat("\n", strrep("=", 60), "\n")
cat("Analysis complete!\n")

sink()

cat("Analysis results saved to results/model_summary.txt\n")

# Read cleaned data
#iris_clean <- read_csv("data/iris_clean.csv", show_col_types = FALSE)



# Plot 1: Sepal Length vs Petal Length
p1 <- ggplot(iris_clean, aes(x = petal_length, y = sepal_length, color = species)) +
  geom_point(size = 3) +
  labs(
    x = "Petal Length (cm)",
    y = "Sepal Length (cm)",
    title = "Sepal vs Petal Length"
  ) +
  theme_minimal()

# Plot 2: Sepal Width vs Petal Width
p2 <- ggplot(iris_clean, aes(x = petal_width, y = sepal_width, color = species)) +
  geom_point(size = 3) +
  labs(
    x = "Petal Width (cm)",
    y = "Sepal Width (cm)",
    title = "Sepal vs Petal Width"
  ) +
  theme_minimal()

# Plot 3: Distribution of Sepal Length by Species
p3 <- ggplot(iris_clean, aes(x = species, y = sepal_length, fill = species)) +
  geom_boxplot() +
  labs(
    x = "Species",
    y = "Sepal Length (cm)",
    title = "Sepal Length by Species"
  ) +
  scale_fill_manual(values = c("lightblue", "lightgreen", "lightcoral")) +
  theme_minimal() +
  theme(legend.position = "none")

# Plot 4: Distribution of Petal Length by Species
p4 <- ggplot(iris_clean, aes(x = species, y = petal_length, fill = species)) +
  geom_boxplot() +
  labs(
    x = "Species",
    y = "Petal Length (cm)",
    title = "Petal Length by Species"
  ) +
  scale_fill_manual(values = c("lightblue", "lightgreen", "lightcoral")) +
  theme_minimal() +
  theme(legend.position = "none")

# Save all plots in a 2x2 grid
joined_plot <- (p1 | p2) / (p3 | p4) +
  plot_layout(guides = "collect")  # Collect all legends into one


ggsave("figures/iris_plot.png", joined_plot, width = 12, height = 8, dpi = 150)

cat("Plot saved to figures/iris_plot.png\n")

Problems:

  • What if I want to change my analysis
  • What if one step fails halfway through?
  • What order should I run?

Decompose this to modular unit scripts

A better example of our typical code workflow

Rather than running 1 gigantic R script, we were smart enough to run smaller scripts of it

# Step 1: Download data
Rscript download_data.R
# Step 2: Clean the data
Rscript clean_data.R
# Step 3: Run analysis
Rscript analyze.R
# Step 4: Make figures
Rscript make_figures.R

You run all four scripts now, Analysis complete!

Then You Update Your Analysis Script…

  • You realize you need to fix a bug in analyze.R

Question: Which scripts need to be re-run?

  1. Just analyze.R
  2. Both analyze.R and make_figures.R
  3. All four scripts?
  4. Not sure…, better run them all!
# Step 1: Download data
Rscript download_data.R
# Step 2: Clean the data
Rscript clean_data.R
# Step 3: Run analysis
Rscript analyze.R
# Step 4: Make figures
Rscript make_figures.R

Problem: Dependency Tracking

After fixing the analysis script, you must:

  1. Re-run analyze.R (it changed)
  2. Re-run make_figures.R (we have “newer” output from 1.)
  3. Skip download_data.R and clean_data.R (inputs unchanged)

But how do you remember this?

Problem: It gets worse

Real research projects have:

  • 10+ scripts running in sequence
  • Complex dependencies (Script C needs outputs from A and B)
  • Long-running steps (some take hours!)

What happens when you update the raw data?

Time to re-run and hope it works

You need to re-run EVERYTHING downstream … just we dont know which are those scripts yet.

What is GNU Make?

“GNU is not unix” (Stallman et al. 1998), a build automation tool that:

  1. Tracks dependencies between files
  2. Runs only what’s needed when files change
  3. Documents your workflow in one place
  4. Ensures reproducibility

Originally designed for compiling software, but perfect for data analysis!

Basic Makefile Syntax

make is a command-line tool that reads and executes a Makefile

target: dependencies
    command

Example:

mean_normal_distribution.txt: sample_from_normal_distribution.R
  Rscript sample_from_normal_distribution.R
  • mean_normal_distribution.txt = target (what you want to create)
  • sample_from_normal_distribution.R = dependency (what you need)
  • Rscript sample_from_normal_distribution.R = command (how to make it)

⚠️ Important: Commands MUST start with a TAB, not spaces!

How Make Works

When you run make mean_normal_distribution.txt:

  1. Check if the mean_normal_distribution.txt exists
  2. Check if the sample_from_normal_distribution.R is newer than mean_normal_distribution.txt
  3. If yes → run the command
  4. If no → skip (already up-to-date!)

This is the magic! 🎩✨

Make only rebuilds what’s necessary based on file timestamps.

Real World Usage

Adapted from https://www.evan-soil.io/blog/ode-to-gnu-make/

  • You want to track how you create every figure you used in your manuscript ideally
  • You want to save time by re-running as little code as possible

Real Example: Data Analysis Pipeline

my_full_analysis.R
suppressPackageStartupMessages(library(dplyr))
library(readr)
library(broom)
library(ggplot2)
library(patchwork)


# Download data from internet
dest <- "data/iris.csv"
url <- "https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv"
download.file(url, dest, quiet = FALSE)

# Read raw data
iris_clean <- read_csv("data/iris.csv", show_col_types = FALSE)

# Clean the data
iris_clean <- iris_clean %>%
  tidyr::drop_na() %>%                     # Remove rows with missing values
  rename_with(tolower) %>%          # Standardize column names to lowercase
  mutate(id = row_number())         # Add a unique ID column

# Write cleaned data
#write_csv(iris_clean, "data/iris_clean.csv")

cat("Data cleaning complete!\n")
cat(sprintf("  Original rows: %d\n", nrow(iris_clean)))
cat(sprintf("  Cleaned rows: %d\n", nrow(iris_clean)))


# Read cleaned data
#iris_clean <- read_csv("data/iris_clean.csv")

# 1. Summary statistics by species
summary_stats <- iris_clean %>%
  group_by(species) %>%
  summarise(
    across(c(sepal_length, sepal_width, petal_length, petal_width),
           list(mean = mean, sd = sd),
           .names = "{.col}_{.fn}")
  )

# 2. Fit a simple linear model
model <- lm(sepal_length ~ petal_length + species, data = iris_clean)
model_tidy <- tidy(model)
model_glance <- glance(model)


# Create output file
sink("results/model_summary.txt")

cat(strrep("=", 60), "\n")
cat("IRIS DATASET ANALYSIS\n")
cat(strrep("=", 60), "\n\n")

cat("Summary Statistics by Species:\n")
cat(strrep("-", 60), "\n")
print(summary_stats)

cat("\n\nLinear Model Results:\n")
cat(strrep("-", 60), "\n")
cat("Model: sepal_length ~ petal_length + species\n\n")
print(model_tidy)

cat("\nModel Fit Statistics:\n")
print(model_glance)

cat("\n", strrep("=", 60), "\n")
cat("Analysis complete!\n")

sink()

cat("Analysis results saved to results/model_summary.txt\n")

# Read cleaned data
#iris_clean <- read_csv("data/iris_clean.csv", show_col_types = FALSE)



# Plot 1: Sepal Length vs Petal Length
p1 <- ggplot(iris_clean, aes(x = petal_length, y = sepal_length, color = species)) +
  geom_point(size = 3) +
  labs(
    x = "Petal Length (cm)",
    y = "Sepal Length (cm)",
    title = "Sepal vs Petal Length"
  ) +
  theme_minimal()

# Plot 2: Sepal Width vs Petal Width
p2 <- ggplot(iris_clean, aes(x = petal_width, y = sepal_width, color = species)) +
  geom_point(size = 3) +
  labs(
    x = "Petal Width (cm)",
    y = "Sepal Width (cm)",
    title = "Sepal vs Petal Width"
  ) +
  theme_minimal()

# Plot 3: Distribution of Sepal Length by Species
p3 <- ggplot(iris_clean, aes(x = species, y = sepal_length, fill = species)) +
  geom_boxplot() +
  labs(
    x = "Species",
    y = "Sepal Length (cm)",
    title = "Sepal Length by Species"
  ) +
  scale_fill_manual(values = c("lightblue", "lightgreen", "lightcoral")) +
  theme_minimal() +
  theme(legend.position = "none")

# Plot 4: Distribution of Petal Length by Species
p4 <- ggplot(iris_clean, aes(x = species, y = petal_length, fill = species)) +
  geom_boxplot() +
  labs(
    x = "Species",
    y = "Petal Length (cm)",
    title = "Petal Length by Species"
  ) +
  scale_fill_manual(values = c("lightblue", "lightgreen", "lightcoral")) +
  theme_minimal() +
  theme(legend.position = "none")

# Save all plots in a 2x2 grid
joined_plot <- (p1 | p2) / (p3 | p4) +
  plot_layout(guides = "collect")  # Collect all legends into one


ggsave("figures/iris_plot.png", joined_plot, width = 12, height = 8, dpi = 150)

cat("Plot saved to figures/iris_plot.png\n")

Real Example: Data Analysis Pipeline

Let’s build a complete analysis pipeline step by step by moving code out from the earlier gigantic R script.

Our pipeline:

  1. Download iris dataset from the web
  2. Clean and filter the data
  3. Run statistical analysis
  4. Generate visualization
  5. (Optional) Create final report with Rmd

Step 1: Download Data

scripts/download_data.R
dest <- "data/iris.csv"
url <- "https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv"
download.file(url, dest, quiet = FALSE)

First rule in our Makefile:

data/iris.csv: scripts/download_data.R
  Rscript scripts/download_data.R

Run it:

make data/iris.csv

Run it again:

make data/raw_data.csv
# make: 'data/raw_data.csv' is up to date.

Step 2: Clean the Data

scripts/clean_data.R
# NOTE: we should hardcode files here, eventually those could be turn to
# command line arguments

suppressPackageStartupMessages(library(dplyr))
library(readr)

# Read raw data
iris_data <- read_csv("data/iris.csv", show_col_types = FALSE)

# Clean the data
iris_clean <- iris_data %>%
  tidyr::drop_na() %>%              # Remove rows with missing values
  rename_with(tolower) %>%          # Standardize column names to lowercase
  mutate(id = row_number())         # Add a unique ID column

# Write cleaned data
write_csv(iris_clean, "data/iris_clean.csv")

cat("Data cleaning complete!\n")
cat(sprintf("  Original rows: %d\n", nrow(iris_data)))
cat(sprintf("  Cleaned rows: %d\n", nrow(iris_clean)))

Add a second rule:

data/iris_clean.csv: data/iris.csv scripts/clean.R
    Rscript scripts/clean_data.R

What this means:

  • To create data/iris_clean.csv
  • You need data/iris.csv AND scripts/clean_data.R
  • Run the R script to process it

If you edit clean_data.R, Make knows to re-run this step!

Step 3: Statistical Analysis

scripts/analyze.R
# analyze.R - Perform statistical analysis on iris dataset (tidyverse style)

suppressPackageStartupMessages(library(dplyr))
library(readr)
library(broom)

# Read cleaned data
iris_clean <- read_csv("data/iris_clean.csv")

# 1. Summary statistics by species
summary_stats <- iris_clean %>%
  group_by(species) %>%
  summarise(
    across(c(sepal_length, sepal_width, petal_length, petal_width),
           list(mean = mean, sd = sd),
           .names = "{.col}_{.fn}")
  )

# 2. Fit a simple linear model
model <- lm(sepal_length ~ petal_length + species, data = iris_clean)
model_tidy <- tidy(model)
model_glance <- glance(model)

# Create output file
sink("results/model_summary.txt")

cat(strrep("=", 60), "\n")
cat("IRIS DATASET ANALYSIS\n")
cat(strrep("=", 60), "\n\n")

cat("Summary Statistics by Species:\n")
cat(strrep("-", 60), "\n")
print(summary_stats)

cat("\n\nLinear Model Results:\n")
cat(strrep("-", 60), "\n")
cat("Model: sepal_length ~ petal_length + species\n\n")
print(model_tidy)

cat("\nModel Fit Statistics:\n")
print(model_glance)

cat("\n", strrep("=", 60), "\n")
cat("Analysis complete!\n")

sink()

cat("Analysis results saved to results/model_summary.txt\n")

Add our third rule:

results/statistics.txt: data/iris_clean.csv scripts/analyze.R
    Rscript scripts/analyze.R

Chain of dependencies:

iris.csv -> iris_clean.csv -> statistics.txt

Make automatically handles the chain!

Step 4: Generate Figures

scripts/make_figures.R
#!/usr/bin/env Rscript
# plot.R - Create visualizations from iris analysis (tidyverse style)

library(ggplot2)
library(patchwork)
library(readr)
library(magrittr)

# Read cleaned data
iris_data <- read_csv("data/iris_clean.csv", show_col_types = FALSE)



# Plot 1: Sepal Length vs Petal Length
p1 <- ggplot(iris_data, aes(x = petal_length, y = sepal_length, color = species)) +
  geom_point(size = 3) +
  labs(
    x = "Petal Length (cm)",
    y = "Sepal Length (cm)",
    title = "Sepal vs Petal Length"
  ) +
  theme_minimal()

# Plot 2: Sepal Width vs Petal Width
p2 <- ggplot(iris_data, aes(x = petal_width, y = sepal_width, color = species)) +
  geom_point(size = 3) +
  labs(
    x = "Petal Width (cm)",
    y = "Sepal Width (cm)",
    title = "Sepal vs Petal Width"
  ) +
  theme_minimal()

# Plot 3: Distribution of Sepal Length by Species
p3 <- ggplot(iris_data, aes(x = species, y = sepal_length, fill = species)) +
  geom_boxplot() +
  labs(
    x = "Species",
    y = "Sepal Length (cm)",
    title = "Sepal Length by Species"
  ) +
  scale_fill_manual(values = c("lightblue", "lightgreen", "lightcoral")) +
  theme_minimal() +
  theme(legend.position = "none")

# Plot 4: Distribution of Petal Length by Species
p4 <- ggplot(iris_data, aes(x = species, y = petal_length, fill = species)) +
  geom_boxplot() +
  labs(
    x = "Species",
    y = "Petal Length (cm)",
    title = "Petal Length by Species"
  ) +
  scale_fill_manual(values = c("lightblue", "lightgreen", "lightcoral")) +
  theme_minimal() +
  theme(legend.position = "none")

# Save all plots in a 2x2 grid
joined_plot <- (p1 | p2) / (p3 | p4) +
  plot_layout(guides = "collect")  # Collect all legends into one

output_plot_path <- "figures/iris_plot.png"
ggsave(output_plot_path, joined_plot, width = 12, height = 8, dpi = 150)

cat("Plot saved to ", output_plot_path, " \n")

Add our fourth rule:

figures/iris_plot.png: results/statistics.txt scripts/plot.R
    mkdir -p figures
    Rscript scripts/plot.R

Now we have a full pipeline:

raw_data.csv -> iris_clean.csv -> statistics.txt -> iris_plot.png

Try it out yourself!

Check how Make auto handles what file needs to be “re-run”

Complete Makefile Example

# Download data
data/iris.csv: scripts/download_data.R
  Rscript scripts/download_data.R

# Clean data
data/iris_clean.csv: data/iris.csv scripts/clean_data.R
    Rscript scripts/clean_data.R

# Analyze
results/statistics.txt: data/iris_clean.csv scripts/analyze.R
    Rscript scripts/analyze.R

# Plot
figures/iris_plot.png: results/statistics.txt scripts/make_figures.R
    Rscript scripts/make_figures.R

Testing Your Pipeline

make figures/iris_plot.png

Make will:

  1. Check if data/iris.csv exists -> download if needed
  2. Check if data/iris_clean.csv is up-to-date -> clean if needed
  3. Check if results/statistics.txt is up-to-date -> analyze if needed
  4. Check if figures/iris_plot.png is up-to-date -> plot if needed

Edit scripts/make_figures.R and run again:

  • Only re-runs the plotting step

Phony Targets: Special Rules

Some targets don’t create actual files:

.PHONY: clean all

all: figures/iris_plot.png

clean:
    rm -f data/iris_clean.csv
    rm -f results/statistics.txt
    rm -f figures/iris_plot.png

.PHONY tells Make these aren’t real files

Usage:

make all    # Build everything
make clean  # Remove generated files, start fresh

Why .PHONY? Without it, if a file named “clean” exists, make clean won’t run!

When Should You Use Make?

✅ Use Make when:

  • Long-running computations (don’t want to rerun everything)
  • Multiple tools involved (R + Python + shell scripts)
  • Team collaboration (document the workflow clearly)

📝 Stick with simple R scripts when:

  • Single analysis that fits in one script
  • Very quick computations (seconds)
  • No intermediate files to track

Best Practices

  1. One Makefile per project in the root directory
  2. Use meaningful target names (not temp1.csv)
  3. Comment your Makefile to explain steps
  4. Use variables for paths and commands
  5. Add .PHONY for non-file targets

Common Pitfalls

Using spaces instead of tabs

target: dependency
  command  # WRONG! This is spaces
  command  # RIGHT! This is a tab

Circular dependencies

a.txt: b.txt
    cat b.txt > a.txt

b.txt: a.txt  # CIRCULAR!
    cat a.txt > b.txt

Debugging Your Makefile

See what Make would do (without running):

make -n target  # "Dry run" - prints commands only

Understand why Make rebuilds something:

make -d target  # Debug mode with verbose output

Force a complete rebuild:

make -B target  # Rebuild everything, ignore timestamps

Pro tip

Start with make -n to check your Makefile logic before running expensive computations!

Resources and Acknowledgements

Official Documentation:

This workshop is based on:

Key Takeaways

Remember these core concepts:

✅ Make tracks dependencies between files automatically

✅ Make only rebuilds what changed (massive time savings!)

✅ Makefiles document your entire workflow in one place

✅ Works with any tools (R, Python, shell scripts, etc.)

Essential for reproducible research workflows

Questions? 🤔

Thank you for attending!

Key message: Make = Your research pipeline’s documentation + automation

Think of your Makefile as a lab notebook for computational work

Start using Make in your next project!

If time Allows

Making Makefile Readable with Variables

# Define variables at the top
DATA_DIR = data
SCRIPTS = scripts
RESULTS = results
FIGURES = figures

# Use variables in rules
$(DATA_DIR)/iris_clean.csv: $(DATA_DIR)/iris.csv $(SCRIPTS)/clean_data.R
    Rscript $(SCRIPTS)/clean_data.R

$(FIGURES)/iris_plot.png: $(RESULTS)/statistics.txt $(SCRIPTS)/plot.R
    mkdir -p $(FIGURES)
    Rscript $(SCRIPTS)/plot.R
  • Easier to read and maintain
  • Less prone to typos

More convenience target

Sometimes, you just want to have everything to send out to your collaborators right after edit

collaboration-packet.zip: data-wrangling.R raw-data.csv paper.Rmd comments.txt
    Rscript data-wrangling.R  clean-data.csv
    gzip collaboration-packet.zip clean-data.csv paper.Rmd comments.txt

Optional: Installing GNU Make on Windows

Option 1: Using Chocolatey (Recommended)

  1. Install Chocolatey (if not already installed):

    • Right-click Start menu → “Windows PowerShell (Admin)”
    • Visit: https://chocolatey.org/install
    • Copy and run the installation command from the website
  2. Install GNU Make:

    choco install make
  3. Verify installation:

    make --version
    # Should show: GNU Make 4.x

Option 2: Using Git for Windows (Built-in)

  • Download and install Git for Windows: https://gitforwindows.org/
  • Make is included! Access via Git Bash terminal
  • Look for “Git Bash” in your Start menu

Troubleshooting: - If make command not found → Restart terminal/PowerShell - Use Git Bash instead of Command Prompt for Option 2

Optional: Installing GNU Make on macOS

Option 1: Using Homebrew (Recommended)

  1. Install Homebrew (if not already installed):

    • Open Terminal (Applications → Utilities → Terminal)
    • Visit: https://brew.sh/
    • Copy and run the installation command:
    /bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"
  2. Install GNU Make:

    brew install make
  3. Verify installation:

    make --version
    # Should show: GNU Make 4.x

Option 2: Xcode Command Line Tools (Built-in)

GNU Make comes with Xcode Command Line Tools:

xcode-select --install

Click “Install” in the popup window

Troubleshooting: - If make --version shows older version → Use brew install make - Homebrew installs as gmake on some systems → Create alias or use gmake

References

Stallman, Richard et al. 1998. “The GNU Project.”